Balanced one-way analysis of variance power calculation
k = 3
n = 94.48714
f = 0.25
sig.level = 0.01
power = 0.9
NOTE: n is number in each group
2025-11-06
AI can be great for helping us do hard stuff we need to do, but that we either don’t want to do or that we’re not sure how to do right:
In all examples, I can verify that an AI solution works.
For research purposes, statistics and data analysis can qualify as “hard stuff we need to do but we’re not sure how to do right.”
Can ask AI for help with stats, but how do you verify if it’s correct?
Mostly safe AI tasks…
Potentially dangerous AI tasks…
I will demonstrate some of the dangers today.
Two most asked questions:
Balanced one-way analysis of variance power calculation
k = 3
n = 94.48714
f = 0.25
sig.level = 0.01
power = 0.9
NOTE: n is number in each group
Notice this is a much larger sample size, over 4 times larger than what Claude tells us.
# Parameters
k <- 3 # number of groups (fuel types)
alpha <- 0.01 # significance level
power <- 0.90 # desired power
f <- 0.25 # Cohen's f for medium effect size
# Calculate sample size per group
result1 <- pwr.anova.test(k = k, f = f, sig.level = alpha, power = power)
result1
Balanced one-way analysis of variance power calculation
k = 3
n = 94.48714
f = 0.25
sig.level = 0.01
power = 0.9
NOTE: n is number in each group
The code confirms you need 95 trials per fuel type, not 22.
Again, Claude gives wrong answer. Power is too high.
library(pwr)
pwr.t.test(n = 30, d = 0.5, sig.level = 0.05,
type = "two.sample", alternative = "two.sided")
Two-sample t test power calculation
n = 30
d = 0.5
sig.level = 0.05
power = 0.4778965
alternative = two.sided
NOTE: n is number in *each* group
Power is 0.48. This study is much less powerful than Claude states.
The last part of its answer:
This is also wrong. For 80% power we need 64 students per group, not 32.
No, it used the wrong sample size per group (60 instead of 30).
This is wrong. Much too big by a factor of 5.
NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
NB: Assuming 0.05 margin of error in estimation of intercept
NB: Events per Predictor Parameter (EPP) assumes prevalence = 0.17
Samp_size Shrinkage Parameter CS_Rsq Max_Rsq Nag_Rsq EPP
Criteria 1 623 0.900 24 0.288 0.598 0.481 4.41
Criteria 2 667 0.906 24 0.288 0.598 0.481 4.72
Criteria 3 217 0.906 24 0.288 0.598 0.481 1.54
Final 667 0.906 24 0.288 0.598 0.481 4.72
Minimum sample size required for new model development based on user inputs = 667,
with 114 events (assuming an outcome prevalence = 0.17) and an EPP = 4.72
Minimum sample size is about 667.
Claude gives the right code (almost). The rsquared argument should be csrsquared.
After providing mostly correct R code, says it “should give you n = 3,661 as the minimum sample size.” That’s wrong, but everything else is correct.
This is correct!
This is very wrong. That’s 4 times too high.
This is the correct code.
This is wrong.
An odds ratio of 3 implies P(diabetes|X = 1) = 0.25.
library(WebPower)
wp.logistic(n = NULL, p0 = 0.1, p1 = 0.25,
alpha = 0.05, power = 0.8,
alternative = "two.sided",
family = "Bernoulli",
parameter = 0.3)Power for logistic regression
p0 p1 beta0 beta1 n alpha power
0.1 0.25 -2.197225 1.098612 218.8331 0.05 0.8
URL: http://psychstat.org/logistic
We need to sample about 220 subjects.
This is also wrong. Right package but nonexistent function.
[1] 223
I had to read the documentation of the powerMediation package, find the correct function (SSizeLogisticBin()), and then figure out how to use the function.
After providing a wrong answer and wrong R code, it asks the following:
If it couldn’t handle a simple logistic regression study plan, why would I trust it with a more complex plan?
Let’s take it up on its offer.
It proceeded to supply sound advice.
The powerLogisticBin() does exist, but does not allow adjustment for covariates. Also the example still uses the nonexistent ssize.logistic() function.
Friendly advice:
Depends on…
In the 1970s, the US Commission on Civil Rights examined charges by Chicago community organizations that insurance companies were redlining their neighborhoods.
To what extent does racial composition of a community affect underwriting practices after controlling for factors that legitimately affect underwriting such as theft and fire damage?
United States Commission on Civil Rights 1979 report:
Insurance Redlining: Fact Not Fiction
Julian Faraway reanalyzes this data in his book Linear Modeling with R (Ch 13).
A slightly modified version of this analysis is available at the following link:
https://static.lib.virginia.edu/statlab/materials/redlining_analysis.html
Copilot offers to analyze data for you.
Let’s give it a try!
When I let Copilot analyze the insurance redlining data:
Instead of letting Copilot guide the analysis, ask it to help with specific tasks:
Friendly advice:
General AI advice:
“…we strongly warn about the non-skeptical use of LLMs. Relying naively on the correctness of their output is irresponsible, and statistical studies can be seriously corrupted. Consequently, expert knowledge from biostatisticians remains indispensable, along with maintaining a questioning stance towards AI outputs.”
Dobler, et al. (2025). “ChatGPT as a Tool for Biostatisticians: A Tutorial on Applications, Opportunities, and Limitations,” Statistics in Medicine.
For statistics help, contact UVA Library StatLab: statlab@virginia.edu
Thank you to Jenn Huck, Hyeseon Seo, Lauren Brideau, and Ethan Kadiyala for suggestions that improved this presentation!
This work is licensed under a Creative Commons Attribution 4.0 International License.
Champely S (2020). pwr: Basic Functions for Power Analysis. https://doi.org/10.32614/CRAN.package.pwr, R package, version 1.3-0, https://CRAN.R-project.org/package=pwr.
Crespi C (2025). Power and Sample Size in R. Chapman & Hall. https://powerandsamplesize.org/
Dobler D, Binder H, Boulesteix AL, et al. (2025). “ChatGPT as a Tool for Biostatisticians: A Tutorial on Applications, Opportunities, and Limitations,” Statistics in Medicine 44, no. 23-24: e70263, https://doi.org/10.1002/sim.70263.
Ensor J (2023). pmsampsize: Sample Size for Development of a Prediction Model. doi:10.32614/CRAN.package.pmsampsize https://doi.org/10.32614/CRAN.package.pmsampsize, R package version 1.1.3, https://CRAN.R-project.org/package=pmsampsize.
Faraway J (2025). Linear Models with R, 3rd Ed. CRC Press. (Chapter 13)
Qiu W (2021). powerMediation: Power/Sample Size Calculation for Mediation Analysis. doi:10.32614/CRAN.package.powerMediation https://doi.org/10.32614/CRAN.package.powerMediation, R package version 0.3.4, https://CRAN.R-project.org/package=powerMediation.
R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Schwarz J (2025). The use of generative AI in statistical data analysis and its impact on teaching statistics at universities of applied sciences, Teach. Stat. 47, 118–128, https://doi.org/10.1111/test.12398.
U.S. Commission on Civil Rights (1979), Insurance Redlining: Fact not Fiction. A report prepared by the Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin Advisory Committees to the U.S. Commission on Civil Rights, Washington, D.C. https://www.usccr.gov/files/historical/1979/79-004.pdf.
Zhang Z, Mai Y (2023). WebPower: Basic and Advanced Statistical Power Analysis. doi:10.32614/CRAN.package.WebPower https://doi.org/10.32614/CRAN.package.WebPower, R package version 0.9.4, https://CRAN.R-project.org/package=WebPower.
Zhu B, (2025). “Biostatisticians Meet AI: Navigating Shifts While Preserving Principles,” Statistics in Medicine 44, no. 20-22: e70271, https://doi.org/10.1002/sim.70271.